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REMARKS 

Applicants gratefully acknowledge Examiner Vicary for courtesies extended during a 
telephone interview dated December 15, 2008, including co-inventor Dr. Gustavson. 

During this interview, Dr. Gustavson explained how the present invention differed 
from the co-pending application used for the double patenting rejection. Dr. Gustavson also 
explained how the prior art was not directed to SIMD k > 1 machines. The Examiner 
explained how he considered a major issue to be his ability to broadly interpret the claim 
language in a manner probably not intended by Applicants. 

Applicants indicated that they would attempt to provide claim language that would 
more narrowly define the invention. 

Dr. Gustavson also explained how the present invention relied upon the register block 
format, as demonstrated, for example by the description at line 7 of page 17, and as more 
fully described in co-pending application, US Patent Application S/N 10/671,888, from 
which co-pending application various aspects and descriptions were incorporated into the 
present application by the previous amendment. Dr. Gustavson also indicated how various 
paragraphs of the current rejection (e.g., paragraphs 54 and 62) were resolved by clarifying 
the register block format in the context of the present invention. Finally, Dr. Gustavson again 
described how the prior art did not address SIMD machines (e.g., SIMD, k > 1). 

Claims 1-9 and 11-19 are all the claims presently pending in the application. Claims 
10 and 20 are canceled. 

It is noted that Applicants specifically state that no amendment to any claim herein, if 
any, should be construed as a disclaimer of any interest in or right to an equivalent of any 
element or feature of the amended claim. 

The Examiner objects to claim 1 for an informality. Applicants believe that the above 
claim amendments appropriately address this concern and request that the Examiner 
reconsider and withdraw this objection. 

Claims 1-9 and 11-19 stand rejected under non-statutory double patenting over claims 
1, 3-6, 8-12, and 14-19 of co-pending application S/N 10/671,937, further in view of 
"Superscalar GEMM-based Level 3 BLAS" to co-inventor Gustavson et al. 

Claims 1-9 and 11-19 stand rejected under 35 U.S.C. § 112, first paragraph, as failing 
to comply with the written description requirement. 

Claims 1-9 and 11-19 stand rejected under 35 U.S.C. § 112, second paragraph, as 
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Claims 1-9 and 11-19 stand rejected under 35 U.S.C. § 102(b) as anticipated by 
Gustavson et al., "Superscalar GEMM-based Level 3 BLAS - The On-going Evolution of a 
Portable and High-Performance Library." 

These rejections are respectfully traversed in the following discussion. 

I. THE CLAIMED INVENTION 

The claimed invention is directed to a method of executing a linear algebra subroutine 
on a machine having at least one floating point unit (FPU) with one or more associated 
load/store units (LSU) to load multiple- word (e.g., SIMD k > 1) data into and out of floating 
point registers (FRegs) of the FPU by way of an LI cache. 

For an execution code controlling an operation of said floating point unit (FPU) 
performing a linear algebra subroutine execution, instructions are inserted to move data into a 
cache providing data for the FPU so that the LSUs can then move the data into the FRegs 
before it is scheduled to by used by said linear algebra subroutine execution. The data being 
prefetched into the cache from memory is in a format predetermined to reduce a number of 
data streams for a level 3 linear algebra processing to be three streams and to allow a multiple 
loading of these streams into said FPU by said LSU, 

The format comprises a register block format wherein data is stored in blocks of size 
p-by-q, where p and q are small integers so that the pieces of these blocks can be fitted into 
the FRegs. The three data streams comprise data of one matrix of the level 3 linear algebra 
processing as considered to be resident in the cache and the two remaining matrix operands 
of the level 3 linear algebra processing as residing in a cache level higher than the cache . 

Conventional compilers do not have the capability to automatically pre-fetch (timely 
move) data into the FPU for Level 3 Dense Linear Algebra Subroutines, particularly in view 
of the newer architectures having FPUs and (SIMD k> 1 ) LSUs. 

The claimed invention, on the other hand, teaches how to timely load data into cache, 
using a non-standard format predetermined to allow the minimum of three data streams and 
to allow multiple loading of words stride one (SIMD k > 1) into the FPU. This feature can 
also be accomplished by conventional compilers, when modified to incorporate the concepts 
of the present invention. 



11 



Serial No. 10/671,889 

Docket No. YOR920030170US1 (YOR.464) 



II. THE DOUBLE PATENT REJECTION 

The Examiner continues to consider that all pending claims of the present application 
are obvious over the claims of co-pending application 10/671,937. 

As indicated during the above-mentioned telephone interview, the seven co-pending 
applications listed at the beginning of the present application (including the present 
application) are considered as capable of working together to improve efficiency on the 
newer SIMD machines in a synergistic manner. As also indicated during the telephone 
interview, all seven of these co-pending applications were filed on the same day so that there 
would be no difference in patent term, excluding possible term extension differences, thereby 
rendering a terminal disclaimer as moot. 

More important, as indicated during the telephone interview, Applicants consider 
these seven applications as patentably distinct improvements in the art. As such, each of the 
seven applications has independent claims specifically intended to cover their perceived 
novelty of that application alone, even if various dependent claims might cover subject matter 
of another co-pending application, given the potential for synergy with another technique. 
Thus, the various claim sets of these seven co-pending applications clearly cover distinctly 
different scopes, when viewed from the perspective of the different independent claims. 

Of particular relevance to the alleged double patenting of the rejection currently of 
record, Applicants bring to the Examiner's attention that the preloading described in co- 
pending application 10/671,937 is actually an alternative to the prefetching method of the 
present invention. That is, the prefetching of the present invention uses the nonstandard 
format of register blocking, which has placed the matrix data into a contiguous format of 
blocks of data designed for loading onto the FPUs. Therefore, the prefetching of the present 
invention does not subsequently utilize the technique of preloading described in 10/671,937. 
Rather, the preloading of 10/671,937 is an alternate method to overcome a one or more cycle 
(a 5-cycle penalty was discussed in the disclosure for an older machine) penalty associated 
with the cache/FPU loading of the newer machines. The FPU has associated loading 
instructions. Thus, incorrectly loaded data can be re-arranged inside the FPU to be in a 
correct format. See the description in co-pending application YOR920030169 (US Patent S/N 
10/671,888) where we describe two errors correcting each other via the pseudo matrix 
concept. 
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The present invention achieves the same effect as the preloading described in 
10/671,937 because of the use of the register block format in the prefetching. Re- 
arrangement of data is not required in the FPU as it is already in the correct format for FPU 
processing. 

Thus, contrary to the Examiner's position, the prefetching and preloading described 
respectively in the present application and in copending application 10/671,937 are not 
sequential processes as described in the rejection currently of record. Rather, they are 
alternate processes and, therefore, clearly patentably distinct. 

Turning now to the first full paragraph on page 5 of the present rejection, the 
Examiner alleges that the Gustavson article in §3.1 describes a non-standard format for 
prefetching: 

"... Gustavson discloses, said data being prefetched into cache from a memory in a 
nonstandard format (section 3.1, first indented paragraph of page 210, technique of keeping 
a small square block ofC in registers; this technique of prefetching C in the format of a small 
square block as opposed to the prefetching of A and B can be considered to nonstandard)" 

In response, Applicants bring to the Examiner's attention that the description in §3.1 
does not make any suggestion whatsoever about the format used for prefetching, since it 
merely describes keeping a small block of C data in the registers, which is an entirely 
different concept from that of describing the format used to prefetch the small block of C data 
kept in these registers. There is no suggestion in this article to use a non-standard format for 
prefetching data into the cache and the Examiner's interpretation of this section in Gustavson 
is clear evidence of improper hindsight, since the Examiner is clearly attempting to extend 
the activity within the FPU registers as indicative of the format used to prefetch data from 
memory into cache. 

There is no correspondence whatsoever in the prior art between the format of data in 
the FPU register set and the retrieval of data from memory. As Applicants keep pointing out, 
the conventional storage/retrieval of data is based entirely upon the standard for each 
computer language, either row major or column major. 

Relative to the Examiner's allegation that Gustavson's article described three streams 
for the level 3 processing, Applicants explained during the above-mentioned telephone 
interview that pre-SIMD machines were not capable of reducing the data streams down to 
only three streams. Rather, for each of the three operands, there were pluralities of streams 
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dependent upon the size of the data blocks for each operand. 

In paragraph 43 on page 17 of the Office Action the Examiner concedes that 
Applicants that prefetching data into cache would be different from preloading data from 
cache into the FPU registers but insists in the final sentence that "... pre-loading can 
nevertheless necessitate pre-fetching as well (which does not mean that they are not the 
same)." 

In response, Applicants again point out that prefetching and preloading are alternative 
methods that solve a one or more cycle (e.g., 5-cycle was exemplarily described in the 
specification) penalty at the cache/FPU interface of the newer architectures. 

Therefore, contrary to the Examiner's rationale in the rejection of record, Applicants 
respectfully submit that the preloading of matrix data in the Gustavson article, as presuming 
that a prefetching of that data precedes the preloading, would not render the present 
prefetching obvious over the co-pending application related to preloading since the machine 
in the Gustavson article used prefetching and preloading in a different context so that they 
were using sequential loading operations. 

In view of the above, Applicants again submit that the preloading technique of co- 
pending application S/N 10/671,937 (e.g., YOR920030171US1) does not render obvious the 
prefetching technique of the present application S/N 10/671,889 (e.g., YOR920030170US1), 
further in view of the prefetching/preloading techniques described in Gustavson 's previous 
article. 

Therefore, the Examiner is again respectfully requested to reconsider and withdraw 
this rejection. 

III. THE 35 USC §112, FIRST PARAGRAPH, REJECTIONS 

Claims 1-9 and 11-20 stand rejected under 35 U. S.C. §112, first paragraph, as 
allegedly failing the written description requirement. Applicants believe that the above claim 
amendments, consistent with Applicants' understanding during the above-mentioned 
telephone interview, and the above specification amendments incorporating additional 
description from co-pending application S/N 10/671,888 (IBM docket YOR920030169US1) 
on the register block format and from co-pending S/N 10/671,935 (IBM docket 
YOR920030330US1) on six level 3 LI kernel routines, appropriately address some of the 
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concerns in this rejection. It is specifically noted that SIMD is specifically discussed 
beginning at paragraph [0081] of co-pending application 10/671,888 related to the register 
block format. 

In view of the above, Applicants respectfully request that the Examiner reconsider 
and withdraw this rejection. 

IV. THE 35 USC §112, SECOND PARAGRAPH, REJECTION 

The Examiner objects to various terms in the claims. In view of the discussion in the 
above-mentioned telephone interview, Applicants believe the above claim amendments 
appropriately address the Examiner's concerns. 

In view of the foregoing, the Examiner is respectfully requested to reconsider and 
withdraw this rejection. 

V. THE PRIOR ART REJECTION 

The Examiner alleges that the article "Superscalar GEMM-based Level 3 BLAS -The 
On-going Evolution of a Portable and High-Performance Library," Para '98, pages 207-215), 
co-inventor Gustavson, et al., teaches the claimed invention. 

In response, Applicants submit that this paper refers only to multiple loads of load 
multiple type k=l. The present application addresses architectures capable of a SIMD load 
with k > 1, and this paper by Gustavson et al., does not cover the specific situation that the 
present invention can address for the newer architectures. 

Moreover, the independent claims of the present application also refer to non-standard 
format used when pre-fetching data from memory into LI cache. This aspect of independent 
claim 1 is not present in this paper. In paragraph 3 on page 12 of the Office Action, the 
Examiner relies upon the description in section 3.1, first indented paragraph of page 210 of 
this paper, presumably meaning the following: 

" This technique, to keep a small square block ofC in registers and replace entries of 
A and B between consecutive iterations of the innermost loop, maximizes the ratio between 
the number ofMAAs and the number of load and store instructions, used to transfer data to 
and from registers, i.e., #MAAs/(#LOADs + #STOREs) is maximized." 
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However, Applicants respectfully disagree with the Examiner that one having 
ordinary skill in the art would agree with the Examiner's characterization that this description 
has anything at all to do with data format, let alone data format used for data moved into LI 
cache. Indeed, Applicants respectfully submit that these words have no suggestion 
whatsoever as to whether the data anywhere in this sentence is in any specific format, since 
the mechanism is not described as dependent upon any such specific data format. 

Therefore, Applicants submit that there is no suggestion in co-inventor Gustavson's 
cited paper concerning non-standard format for data transfer between memory and cache, as 
required by the plain meaning of the claim language. 

Hence, turning to the clear language of the claims, in Gustavson there is no teaching 
or suggestion of: "... for an execution code controlling an operation of said floating point 
unit (FPU) performing a linear algebra subroutine execution, inserting instructions to move 
data in a contiguous and stride one format either into a cache providing data for said FPU for 
direct loading into said FPU, so that said LSUs can load said data into said FRegs before it is 
scheduled to by used in said linear algebra subroutine execution, said data being prefetched 
into said cache from a memory in a register block format predetermined to reduce a number 
of data streams for a level 3 nested loop matrix-matrix type kernel type operation processing 
to be three str eams and to allow a loading of these streams into said FPU by said LSU , said 
register block format comprising a data storage format wherein data is stored in blocks of size 
p-by-q where p and q are small integers so that the pieces of these blocks can be fitted into 
said FRegs, and wherein said three data streams comprise data of one matrix of said level 3 
processing as considered to be resident in said cache and data for two remaining matrix 
operands of said level 3 processing as residing in a cache level higher than said cache ", as 
required by independent claim 1. The remaining independent claims have similar language. 

In view of the above, the Examiner is respectfully requested to withdraw this 
rejection. 

VI. FORMAL MATTERS AND CONCLUSION 

In view of the foregoing, Applicant submits that claims 1-9 and 11-19, all the claims 
presently pending in the application, are patentably distinct over the prior art of record and 
are in condition for allowance. The Examiner is respectfully requested to pass the above 
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application to issue at the earliest possible time. 

Should the Examiner find the application to be other than in condition for allowance, 
the Examiner is requested to contact the undersigned at the local telephone number listed 
below to discuss any other changes deemed necessary in a telephonic or personal interview . 

The Commissioner is hereby authorized to charge any deficiency in fees or to credit 
any overpayment in fees to Assignee's Deposit Account No. 50-0510. 

Respectfully Submitted, 



Date: January 2 1 , 2009 




Frederick E. Cooperrider 
Registration No. 36,769 



McGinn Intellectual Property Law Group, PLLC 

8321 Old Courthouse Road, Suite 200 
Vienna, VA 22182-3817 
(703) 761-4100 
Customer No. 21254 



CERTIFICATION OF TRANSMISSION 

I certify that I transmitted electronically, via EFS, this Amendment under 37 CFR 
§1.111 to the USPTO on January 21, 2009. 




Frederick E. Cooperrider (Reg. No. 36,769) 
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